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Multidimensional  Contingency  Tables 

Leonardo  D.  Epstein  Stephen  E.  Fienberg 

Department  of  Biostatistics  Department  of  Statistics 

The  Johns  Hopkins  University  Carnegie  Mellon  University 

Baltimore,  Md.  21205  Pittsburgh,  Pa.  15213 


Abstract 

This  paper  discusses  a method  suggested  by  Epstein  and 
Fienberg  (1991)  for  the  Bayesian  analysis  of  multidimen- 
sional contingency  tables  in  connection  with  the  Gibbs 
sampler  to  calculate  posterior  densities. 

The  method  consists  of  a two-stage  hierarchical  prior. 
The  first  stage  is  a Dirichlet  distribution  with  a loglin- 
ear  reparametrization  for  its  means.  The  second  stage  is 
a multivariate  normal  distribution  on  the  loglinear  pa- 
rameters. However,  other  distributions  can  be  used  if 
the  Dirichlet-normal  combination  is  not  flexible  enough 
to  accomodate  one’s  prior  beliefs. 

These  prior  distributions  are  useful  when  one  believes, 
with  uncertainty,  in  a given  loglinear  structure  for  the 
cell  probabilities. 

Key  words:  Contingency  tables;  Bayesian  estimation; 
Dirichlet  prior  distribution;  Gibbs  sampler;  Loglinear 
model;  Maximum  likelihood  estimation  of  Dirichlet  dis- 
tributions. 

1 Introduction 

A new  Bayesian  method  for  the  analysis  of  multidimen- 
sional contingency  tables  was  recently  proposed  by  Ep- 
stein and  Fienberg  (1988)  and  Epstein  (1990).  As  with 
many  other  Bayesian  methods,  ours  uses  the  posterior 
means  of  the  cell  probabilities  to  estimate  these  parame- 
ters. The  focus  on  posterior  means  is  in  part  due  to  the 
importance  of  point  estimation  and  in  part  due  to  com- 
putational difficulties  in  drawing  further  inferences  from 
the  posterior  The  purpose  of  this  article  is  to  illustrate 
with  an  example  how  to  use  the  Gibbs  sampler  to  com- 
pute estimates  of  the  posterior  densities  that  arise  from 
our  method.  These  density  estimates  are  readily  inte- 
grate to  compute  posterior  probabilities  and  moments. 

We  introduce  the  esentials  of  the  metb  A via  a simple 


example.  Suppose  our  interest  is  on  inferences  about  the 
array  of  cell  probabilities  8 = {0,j } of  a 2 x 2 contingency 
table  and  suppose  also  that  given  8,  the  observed  counts 
x = {x1; } follow  a multinomial  distribution  M(N,8). 
We  label  the  two  factors  by  1 and  2. 

When  the  data  follow  a multinomial  distribution  to 
model  prior  beliefs  it  is  common  to  use  the  conjugate 
Dirichlet  prior  D(A',rj)  with  density 

[0|/Gr,]=^nCJ'1- 

where  /?  = r(A')/n„  and  rj  = 

Before  the  observation  of  x we  might  believe  with  some 
uncertainty  that  the  two  factors  are  independent.  That 
is,  we  might  believe  that  8 satisfies  0,;-  = 0,+0+J  , * = 
1,2,  j = 1,2,  with  some  degree  of  uncertainty. 

The  condition  = 0,+0+j  is  equivalent  to 

log(?0  =u  + u1(i;)  + u2(0).  (1) 

with  the  restriction  that  the  term  u in  this  equation  is 

« = ~ + «2 (ij)h  (2) 

so  that  j 9ij  = 1.  This  normalization  leads  to  an 
equivalent  parametrization  that  uses  the  multivariate 
logits,  i.e. , if 


then  the  7 are  the  multivariate  logits  (see  Leonard  and 
Novick,  1976).  The  parametrization  (1)  and  the  normal- 
izing condition  on  u are  equivalent  to  reparametrizing 
{7i>}  using 

7.j  = «i(ij)  + “2  (0)- 

Unless  neccessary,  the  remainder  of  this  paper  onuts 
explicit  reference  to  the  normalizing  role  of  u.  Thus, 
we  will  simply  speak  of  the  loglinear  parametrization 
log  @i j = U+  Ul(,j  ) + «2(,j  ) 
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To  see  that  the  parametrization  (1)  is  equivalent  to 
independence,  substitute  the  value  of  u back  in  equation 
( 1 ) to  get 


eux>j)  eu3(u> 

y,  eUa(*j> 


Hence 


0i+ 


eui(w) 

E,£U,(,J) 


and  6j+  = 


euJ(o) 


To  incorporate  in  the  prior  our  uncertain  belief  in  in- 
dependence, Epstein  (1990)  and  Epstein  and  Fienberg 
(1991)  proposed  using  a loglinear  parametrization  on  the 
Dirichlet  means.  That  is,  to  reflect  the  plausibility  that 
the  cell  probabilities  satisfy  (1)  they  suggest  using 


log'lij  — u + ul(ij)  + u2  (i;)i  (3) 


with  u = - log(Eij  exP(ui(0)  + «2 (O'))- 

The  index  “1”  in  ui(jj)  indicates  that  this  u-term  de- 
pends only  on  the  index  i.  It  is  more  common  to  omit 
the  indices  on  which  the  u-terms  do  not  depend.  Thus 
often  we  write  ui(,)  and  u2(;)  instead  of  ui(ij)  and  u2(y), 
but  the  fact  that  } and  {u2(ij)}  are  arrays  of  the 

same  dimensions  as  { rjij  } simplifies  many  formulas. 

To  establish  the  connection  with  the  multidimen- 
sional case,  wc  note  that  parametrization  (3)  maps 
an  array  {7^}  belonging  to  the  linear  subspace 
M = {{7,7}  : 7 ,7  = u1(i)  + u2(j)}  into  the  array 

{exp(7o)/E,;  exp(7y)}. 

The  parametrization  (3)  implies  that 


W+  - 


E,eu,"J) 


and  t)+j 


C U3('J) 

E;  e<il<,,)  ' 


(4) 


Thus,  { « | ( ) } and  {u2(y)}  parametrize  the  marginal  ar- 
rays {*?,+  } and  {*/+/},  respectively. 

We  follow  the  notation  of  Andersen  (1974)  to  repre- 
sent marginal  tables,  and  the  definition  will  be  recalled 
in  section  2 more  formally.  This  notation  represents  the 
marginal  array  with  entries  77*+  by  tjy  = {*7,7},  where 
Y = {1}.  The  set  of  factor  labels  Y indicates  that  17' 
depends  only  on  the  index  corresponding  to  factor  1, 
namely  i,  and  that  tj  was  collapsed  over  the  indices  cor- 
responding to  the  factors  not  in  Y , namely  j.  We  will 
also  use  products  of  arrays.  Thus,  for  example,  the  prod- 
uct of  and  denoted  by  is  the  array 

whose  (i,j)  entry  is  or,  in  the  usual  notation, 

it+n+j 

The  parametrization  log  17,7  = u + u1(lj)  + «2(i>)  *s 
equivalent  to  77,7  = with  0 < rjjp  < 1, 

0 < r/J }j2)  < 1,  17}] } + i}[) 1 = 1,  and  t}{^]  + 17^ 1 = 1 


(see  Albert  and  Gupta,  1982).  However,  the  loglinear 
parametrization  on  the  Dirichlet  means  allowed  Epstein 
and  Fienberg  (1991)  and  Epstein  (1990)  to  extend  the 
method  to  multidimensional  tables. 

If  we  feel  we  cannot  specify  a value  for  tjjp  and  t]fp, 
or,  equivalently,  for  tii(ij)  and  u2 (y>,  then  the  Dirichlet 
distribution  cannot  adequately  represent  our  prior  be- 
liefs. However,  as  Albert  and  Gupta  (1982)  point  out, 
the  Dirichlet  distribution  may  still  be  used  as  the  first 
stage  of  a two-stage  prior.  With  a loglinear  parametriza- 
tion for  the  Dirichlet  means  there  are  two  equivalent  al- 
ternative ways  to  '■omplete  the  two-stage  prior.  One  may 
use  distributions  on  the  u-terms  or  one  may  prefer  to 
specify  distributions  on  rjjj  ^ and  77 \jp  directly. 

The  loglinear  parametrization  will  be  more  useful 
when  analyzing  tables  of  higher  dimensions  where  one 
may  consider  more  complex  loglinear  structures.  As  the 
next  section  explains,  with  loglinear  parametrizations  for 
71-way  tables  one  can  also  specify  the  second-stage  in  two 
alternative  ways,  but  to  use  the  second  one  must  deter- 
mine the  generating  class  of  the  loglinear  parametriza- 
tion and  use  the  margins  of  tj  given  by  the  generator  as 
parameters  of  the  Dirichlet  distribution. 

The  parameter  K governs  the  concentration  of  the 
prior  distribution  about  the  independence  surface 

5 = mj}\oij=ei;]9}p, 

0 < 0,7 1 < 1,0  < ejp  < 1, 

0!!,+^)  = i,^)+0l22)  = i}. 


In  the  limit,  as  K — 00,  the  prior,  and  therefore  the 
posterior,  concentrate  all  of  their  mass  on  S. 

When  we  use  a two-stage  prior  we  obtain  the  posterior 
means 


e(6ij\x)  = 


N 


K 


N + K N N + K 


7e(ni}]vip\x), 


(5) 


which  we  use  to  estimate  9.  The  expectation 
£(TJiPTliP  I*)  *s  with  respect  to  the  distribution  induced 
on  17! '1  and  r/l2l  through  equations  (4) 

In  most  practical  situations,  when  K — ► 0 the  poste- 
rior means  e(0ij  |x)  converge  to  the  observed  proportions 
Xij/N . When  K — ► 00  not  only  the  posterior  distribu- 
tion concentrates  the  all  of  its  mass  on  S , but  the  pos- 
terior mean  e(0|z)  itself  belongs  to  S.  This  property 
translates  into 


Urn  £(0,7|x)  = hm/Kjl*)  * Um  *(*$ }|*)- 


ic-o 


K— *0 


It  shows  that  the  estimates  corresponding  to  increasing 
values  of  K reflect  an  increasingly  strong  prior  belief  in 
the  plausibility  of  independence  of  the  two  factors  by 
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compromising  between  estimates  obtained  under  a sat- 
ura'ed  model  and  estimates  obtained  under  an  indepen- 
dence model.  Epstein  (1990)  showed  that  this  property 
holds  for  general  loglinear  parametrizations. 

With  this  introductory  example  it  is  now  easy  to  see 
how  our  approach  extends  to  tables  of  higher  dimension. 
If  we  believe,  with  uncertainty,  in  a given  loglinear  struc- 
ture for  the  cell  probabilities,  we  use  a two-stage  prior. 
In  the  first  stage  use  a Dirichlet  distribution  with  means 
having  the  same  loglinear  structure.  In  the  second  stage 
use  distributions,  Gaussian  for  example,  on  the  u-terms 
of  the  loglinear  parametrization. 

In  the  introductory  example  we  speak  of  independence 
being  a plausible  structure  for  the  cell  probabilities  to 
indicate  that  we  believe  in  independence  only  to  a certain 
degree.  In  general,  we  will  speak  of  a plausible  loglinear 
structure  to  indicate  that  we  believe  in  that  structure 
only  to  a certain  degree. 

In  the  multidimensional  case,  as  A’  — > oo,  the  prior 
and  therefore  the  posterior  concentrate  all  their  mass 
in  the  subset  of  arrays  T)  defined  by  the  loglinear 
parametrization.  Epstein  (1990)  studied  properties  of 
the  posterior  means  as  estimators  when  the  loglinear 
parametrization  on  rj  is  hierarchical. 

The  next  section  reviews  the  extension  of  the  method 
for  multidimensional  tables  and  the  basic  elements  of 
loglinear  parametrizations. 

Section  3 presents  our  implementation  of  the  Gibbs 
sampler.  The  implementation  requires  finding  maximum 
likelihood  estimates  for  Dirichlet  means  under  a loglinear 
parametrization.  Subsection  3.1  describes  the  use  of  the 
projection  gradient  method  to  compute  these  maximum 
likelihood  estimates.  Additionally,  section  3 discusses 
a rejection-acceptance  scheme  to  draw  deviates  from  a 
posterior  distribution  that  does  not  require  the  marginal 
( predictive  ) distribution.  Section  4 illustrates  the  im- 
plementation of  the  Gibbs  sampler  and  the  method  of 
Epstein  and  Fienberg  ( 1991 ) with  simple  sociological  ex- 
ample concerning  student  politics  and  family  structure. 

2 A Bayesian  Method  for  Multi- 
dimensional Tables 

In  this  section  we  review  the  method  proposed  by  Ep- 
stein (1990)  and  Epstein  and  Fienberg  (1991)  for  mul- 
tidimensional tables.  We  refer  the  reader  to  Epstein 
(1990)  for  proofs  and  a detailed  discussion  of  this  sec- 
tion’s results. 

Following  the  notation  of  Andersen  (1974),  consider 
n factors  or  treatments  labeled  1, 2, . . . , n,  with  factor  i 
having  r,  levels.  Define  r,  = {!,...,  r, } and  call  it  the 


set  of  levels  of  factor  i.  The  set  / = fi  x • • • x rn,  is 
usually  refered  to  as  the  ind cx  set  or  the  set  of  cells. 

A selection  of  levels  < = (*i , *2 > • • • . *n).  a generic  el- 
ement in  /,  is  often  referred  to  as  the  (i'i,  ij, . . . , in)- 
cell.  One  obtains  a iq  x ■ • • x rn  contingency  table 
* = {xt,i  £ /}  when  N individuals  are  examined  and 
cross-classified  according  'o  the  levels  of  each  of  the  fac- 
tors. 

We  shall  assume  that  x = {x,,i  £ /}  has  a multino- 
mial A/(N,{0t})  distribution,  where  6 , is  the  probability 
of  an  individual  being  classified  in  cell  t.  However,  the 
method  easily  adapts  to  other  sampling  distributions, 
such  as  Poisson  and  product  multinomial  (Bishop,  F’ien- 
berg,  and  Holland,  1975). 

In  the  first  stage  use  a Dirichlet  D(I\,i])  distribution 
with  density 

(6) 

<e/ 

indexed  by  r;  = £ /},  and  where  /?  = t-t  1 — -. 

1 l.cr 1 (Ar,D 

The  parameter  I\  > 0 is  prespecified.  Thus,  e(0t  \ K,  ij)  = 
T)t  for  t g I. 

Let  w C n,  i.e. , w is  a set  of  factor  labels.  We  shall 
denote  u„  the  interaction  parameter  among  the  factors 
in  w.  More  specifically,  the  interaction  uw  is  the  n x 
. . . x rn  array 

uw  = {**„,(,,  .»„)}> 

where  the  entries  ,n)  of  u„,  depend  only  upon  the 

indices  with  j £ w.  Often  the  interactions  are  taken 
to  satisfy  the  usual  ANOVA  constraints,  i.e.,  the  sum 
of  the  entries  uW(ii, over  the  levels  of  any  factor 
j £ w is  zero.  These  constraints  achieve  idcntifiabil- 
ity  of  the  parametrization.  The  Bayesian  approach  does 
not  require  identifiable  parametrizations  and  therefore 
we  need  not  use  constraints.  Their  use,  however,  is  not 
precluded.  One  should  use  them  whenever  they  facilitate 
producing  a prior  distribution  reflecting  one’s  beliefs. 

Loglinear  parametrizations  are  usually  used  for  the 
multinomial  parameters.  The  model  defined  by 

log®  = ^ uw,  (7) 

is  the  saturated  or  unrestricted  model.  Whenever  a vec- 
tor, x say,  appears  as  the  argument  of  a real  function  of 
one  variable,  / say,  then  /(*)  shall  stand  for  the  vector 
(/(*  l) /(*!))'• 

The  entries  of  the  array  U|,  where  0 is  the  empty  set, 
are  all  the  same.  The  term  ut  is  usually  referred  to  as 
the  constant  term. 
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In  the  general  case  we  can  use  the  multivariate  logits 
7,  by  writing: 


1 lie  parametrization  (7)  is  equivalent  to 

1=  52  «*. 

u»  C 

We  obtain  submodels  by  including  only  some  interac- 
tions in  the  formula  above.  To  specify  which  interactions 
we  include  in  a submodel,  we  use  a class  of  subsets  of  n 
which  we  call  A.  For  example,  we  write 

log#  = 52  u<" ' 

u>£A 

to  specify  a parametrization  which  only  includes  the  in- 
teractions among  factors  in  w,  with  w E A. 

We  are  concerned  with  making  inferences  when  we  feel 
it  is  plausible  that 

log#  = 52  «um  (8) 

where  A is  a strict  subset  of  n.  To  incorporate  this 
belief  into  the  prior  we  suggest  that  instead  of  using  the 
loglinear  parametrization  on  # we  use  it  on  the  Dirichlet 
means,  that  is, 

•og t;  = (9) 

U’6^ 

This  restricts  log  ?/  to  lie  in  a linear  subspace  M of 
To  ensure  that  the  parametrization  is  such  that 
£,£/#«  = 1.  it  is  necessary  to  assume  that  M contains 
the  array  1 whose  entries  are  all  1.  In  the  introductory 
example  the  class  A is  { { 1 } , {2} } and  therefore  equation 
(9)  becomes 

log  VlJ  = «+  «1(,.,)  + 1/2(0)- 

In  the  parametrization  (9)  the  term  Uy  = {u}  must 
satisfy 

u --  -log(^exp(  52  “«-(.))). 

*€/  w£A,w£$ 

so  that  ]T]i6/7/,  = !•  term  u,  in  (8)  must  satisfy 
this  restriction  as  well.  The  restriction  on  u»  will  remain 
implicit  whenever  we  refer  to  parametrizations  such  as 
those  in  (8)  and  (9) 

In  summary,  we  suggest  a two-stage  hierarchical  prior. 
The  first  stage  consists  of  setting  # ~ D(I\,ij)  where 
is  parametrized  using  (9)  The  second  stage  consists  of 
setting  distributions  on  the  u-terms  in  (9). 


As  a consequence  of  using  the  parametrization  (9)  one 
can  specify  a value  for  r)  by  specifying  values  for  some 
margins  of  rj  . For  example,  iflogfyj  = u+Ui(0)+u2(0)’ 
then  rjij  = T)i+T)+j.  In  this  fashion  we  specify  the  rc 
values  rjjj  by  specifying  values  for  ij <+  and  rj+j,  a total 
of  only  r + c values. 

This  result  extends  to  the  general  case.  When  the  log- 
linear  parametrization  (9)  is  hierarchical  then  r]  is  to- 
tally specified  by  the  value  of  the  margins  r/y' , . . . , t]Yt  , 
where  {Y) , . . . , > 7-}  is  the  generating  class  of  the  loglinear 
parametrization.  Therefore,  we  can  implement  the  sec- 
ond stage  either  by  using  distributions  on  the  u-terms  or 
by  using  distributions  on  the  margins  i)Y> , . . . , Tjlr . For 
Y C n the  Y-margin  is  defined  as  being  the  array 
whose  entries  are 

= 52  ’»  = 52  -in- 

n\v  1,6/7, jen\v 

3 Implementation  of  the  Gibbs 
Sampler 

This  section  describes  the  specifics  of  the  implementa- 
tion of  the  Gibbs  sampler  for  calculating  the  posterior 
densities  of  the  cell  probabilities. 

We  start  with  a brief  review  of  the  Gibbs  sampler  and 
refer  the  reader  to  Gelfand  et  al.  (1990)  and  Gelfand 
and  Smith  (1990)  for  a detailed  description  of  the  use  of 
Gibbs  sampling  in  Bayesian  inference. 

Suppose  that  one  wishes  to  estimate  the  density  [A’] 
of  the  random  variable  Ar  assuming  it  is  possible  to  draw 
deviates  from  the  conditional  densities  [A|Y]  and  [Y|A], 
where  Y is  another  random  variable. 

The  algorithm  consists  of  iteratively  repeating  a two- 
step  cycle.  Before  starting  one  draws  a deviate  A*0* 
from  an  arbitrary  density  [A’]o.  Step  one  of  the  cy- 
cle is  to  draw  a deviate  V ^ 1 1 from  [V|A’(°*].  Step 
two  is  to  draw  A'(1)  from  [A'|Y(1)].  Then  one  first 
replaces  A'10’  by  A'(1)  and  proceeds  with  the  second 
cycle.  A succession  of  cycles  produces  a sequence 
(A*1*,  Y(1)),  (A'(2),  Y*2)), . . . , (A'(,),  Y^) The  se- 

quence A’*’1  converges  in  distribution  to  A ~ [A]  and 
Y<°  converges  in  distribution  to  Y ~ [Y], 

Gelfand  and  Smith  (1990)  suggest  building  an  esti- 
mate of  the  density  [A']  as  follows.  Using  the  Gibbs 
sampler,  obtain  m independent,  replicates  (A'j1',  Y/1^), 

. . . , (Am\  Ym  V With  these  deviates  obtain  the  density 
estimate 

m 

(A]  = m-l^(A|Y;"]. 
j = i 


(10) 
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We  use  Gibbs  sampling  to  estimate  the  posterior  dis- 
tribution [0|x].  For  the  two-stage  prior  8 ~ [0)77]  and 
i j ~ [»/].  We  identify  X with  8 and  Y with  77  and  use 
the  following  Gibbs  scheme  to  draw  deviates  from  the 
posterior  [0|z]: 

do  j = 1 , m 
Set  8{0) 
do  i = 1 , t 

Step  1:  draw  77  from  [77 10^] 

Step  2:  draw  0( from  [8\i],x] 

0(°)  _ q( i) 

end  do 


end  do 

On  exit,  this  process  has  generated  m independent  de- 
viates i/p  ~ [77^],,/  = and  rn  independent 

deviates  9 ~ [0*(*],  j = 1, . . . , m. 

With  these  deviates,  the  density  estimate  in  equation 
(10)  is  a finite  mixture  of  Dirichlet  densities, 

m 

It  is  particularly  simple  to  evaluate  the  marginal  den- 
sity estimate  of  a cell  probability.  For  example,  in  the 
situation  of  the  introductory  example, 

m 

[01,1*]  = m_1  X^iil1/;0.*].  (H) 

j= i 

where  [0\i\i],x]  is  the  beta(  A’D/i, , A'*(l  — 77^))  density 
with  l\ * = N + K,  r)‘u  = <*7711  -f  (1  — q)xh/N,  and 
a = K/(N  + l<). 

Automatically  monitoring  convergence  is  still  an  open 
issue;  at  present  the  best  one  can  do  is  to  prespecify  the 
total  number  of  iterations  t,  say. 

The  distribution  [8\ij,x],  which  is  used  in  step  two 
of  a Gibbs  sampler  cycle,  is  Dirichlet  with  concentra- 
tion parameter  A'*  = (N  + A)  and  cell  means  rj"  = 
A’/ ( N + l\)i)t  + l/(N  + A’  )x, . Drawing  deviates  from 
the  distribution  (0|»7,  x]  is  straightforward.  We  chose 
to  generate  these  deviates  by  independently  generating 
7,  ~ Gamma(p,,  1)  ,t  € / with,  p,  = A''r/* , and  then 
setting  0,  = 7i/ZIi'€/ T'j'-  The  joint  distribution  of  {9,} 
is  Dirichlet  D(A’*,{g,})  with  </,  = p,/A'*. 

However,  drawing  deviates  from  the  distribution 
M0,o>]  = [fl(0)l»?]  M/[0(O)],  used  in  step  one  of  a cy- 
cle, is  not  straightforward. 

We  suggest  the  following  adaptation  of  the  rejection 
method  to  sample  from  [t/|0*O)].  This  adaptation  uses 


deviates  77  from  [77],  which  are  easy  to  generate,  to  obtain 
deviates  from  [t7|0^]  = [0^|»?]  [»?]/[^f °'1] • 

accept  <—  false 

do  while  ( not(  accept  ) ) 

generate  a deviate  77  from  [17] 
generate  a deviate  v from  (/[0,  B ] 
if  ( v < [0^|»?]  ) then  accept  <—  true 
end  do 

Above,  B is  sucli  that  B > [0in)|»7]  for  all  tj  in  its 
domain.  It  is  simple  to  show  that  an  accepted  77  is 
a deviate  from  [77(0^].  An  important  feature  of  this 
approach  is  that  it  does  not  require  the  calculation  of 
[010'],  or  an  estimate  of  it,  as  is  sometimes  necessary 
in  some  implementations  of  the  Gibbs  sampler  ( see  for 
example  Gelfand  and  Smith,  1990). 

A geneneralized  rejection  method  that  uses  an  en- 
veloping function  B(i])  for  77  — > [7j|0]  may  increase  the 
speed  of  this  algorithm.  At  present,  however,  we  will 
content  ourselves  with  a boxed  envelop,  the  main  advan- 
tage being  the  ease  of  programming.  Obtaining  a good 
value  for  B is  crucial  for  a good  performance  of  the  re- 
jection method.  The  ideal  choice  is  to  find  77  such  that 

[0(O)|»>]  = max{[0(o,|T7]  : logT7=  uB), 

and  then  take  B = [0^|t/].  Observe  that  77  — ► [0^O)|t7], 
is  the  Dirichlet  likelihood  function  given  the  data  0(o^. 

The  next  subsection  introduces  a maximization  proce- 
dure to  find  B.  The  procedure  appears  to  be  fast  enough 
to  use  it  in  combination  with  the  Gibbs  sampler. 

Observe  that  under  the  loglinear  parametrization  7 = 
log  77  = YlweA  uum  7 = l°g»/  is  the  maximum  likelihood 
estimate  of  7 . 

3.1  Maximizing  the  Dirichlet  Likelihood 

In  this  section  we  briefly  describe  the  “gradient  projec- 
tion method”  and  apply  it  to  maximize  the  Dirichlet 
loglikelihood.  In  addition  to  being  easy  to  implement, 
various  features  of  the  Dirichlet  likelihood  and  loglinear 
parametrizations  make  the  gradient  projection  method 
preferable  to  other  methods.  We  discuss  the  advantages 
of  the  gradient  projection  method  after  introducing  ad- 
ditional definitions. 

Recall  that  a loglinear  parametrization  for  77  restricts 
log77  to  lie  in  a linear  subspace  A/.  The  usual  form 
of  writing  a loglinear  parametrization  with  u-terms  ex- 
presses 7 £ M in  terms  of  a basis  matrix  of  A/,  i.e.,  a 
matrix  B whose  columns  form  a basis  for  A/.  When  ex 
pressing  a vector  7 £ A/  in  terms  of  the  unique  u such 
that  7 = Bu,  the  coordinates  of  u are  the  u-terms. 
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To  avoid  technical  complications  that  the  restriction 
]P,e/  //,  = 1 introduces,  we  redefine  some  functions  of »/ 
as  functions  of  jj,,t  ^ (r[,...,r„)  only.  To  this  efTect, 
for  tj  given  we  define  rj  as  r),  = ry,,  i £ / with  / = 
I — {(ri , . . . , rn)}  as  the  index  set  for  the  vectors  fj. 

Maximizing  the  Dirichlet  likelihood  /(*](#)  is  equiva- 
lent to  maximizing 

£■(»/)  = A'  < t7,  A > - IogF(A'v,), 

‘ei 

where  A = log#  and  j/  is  given  by  i),  = ry,  ,r  g / and 
7/(ri  rn)  = Fxcept  for  an  additive  constant, 

£’(?/)  is  log/(»/|0). 

The  Dirichlet  means  corresponding  to  the  multivari- 
ate logits  7 are  given  by  »y  = H(-/)  with  r y,  = 

exp(7,)/  YL,'ei  exP(7t'),  t £ /■  To  use  the  paramctriza- 
tion  with  the  multivariate  logits  it  is  convenient  to  define 

<7(7)=E(//(7)).  (12) 

Observe  that  if  we  use  the  pararnetrization  with  the  u- 
terms,  then  we  may  find  u,  the  m.I.e.  of  u , by  maxi- 
mizing U (u)  = G(Hu). 

Roughly  speaking,  there  are  three  classes  of  alter- 
native' methods  to  maximize  U.  One  possibilty  is  to 
solve  the  equation  JU(u)  = 0,  where  IV  stands  for 
the  array  of  partial  derivatives  of  U . Typically,  itera- 
tive procedures  to  solve  this  equation  require  updating 
an  estimate  of  the  Hessian  of  U after  some  iterations. 
On  the  one  hand,  it  is  difficult  to  obtain  formulas  for 
the  second  derivatives  of  U and  on  the  other,  comput- 
ing second  derivatives  numerically  is  in  general  expen- 
sive and  roundoff  errors  are  difficult  to  control.  Since 
U(n)  = G(Bu ),  this  approach  poses  the  additional  dif- 
ficulty of  explicitly  requiring  a basis  matrix  for  M 

An  alternative  is  to  use  a steepest  ascent  method 
where  at  each  step  there  is  a unidimensional  search  along 
the  direction  JU(u).  This  alternative  also  requires  a 
basis  matrix.  In  fact,  any  method  that  uses  u as  the 
variable  of  the  objective  function,  will  require  a basis 
matrix. 

The  gradient  projection  method  is  preferable  to  these 
alternatives  because  it  does  not  require  estimating  Hes- 
sians or  a basis  matrix  of  M . Moreover,  the  gradient 
projection  method  allows  us  to  take  advantage  of  the 
ANOVA-type  pararnetrization  for  7 to  perform  certain 
computations  more  efficiently. 

To  use  the  gradient  projection  method  we  view  the 
problem  of  maximizing  the  Dirichlet  likelihood  as  the 
problem  of  finding  7 £ M such  that 

(7(7)  = max{<7(7),7  € M), 


which  is  a constrained  maximization  problem.  A point 
7 £ M is  refered  to  as  a “feasible  point”.  The  gradient 
projection  method  projects  the  gradient  of  the  objective 
function  onto  M to  increase  the  value  of  G(~i)  and  to 
maintain  feasibility  at  the  same  time. 

The  following  is  a summary  of  the  gradient  projection 
to  solve  the  above  maximization  problem: 

Step  1 Initialization:  Choose  70  6 M 
Let  n0  = G( 70) 

Step  2 Compute  da  = JG{~i0) 

Step  3 Compute  60  = P\idu 

Step  4 Unidimensional  maximization: 

Find  a > 0 such  that 

G(y0  + d*o))  = maxa>0(7(70  + a60)) 

Set  7)  = 70  + dS 0 
Step  5 Convergence  test: 

Let  V\  = G(7j) 

If  (17  — Vq)/vq  < ( then  stop 

else  70  ♦ — (1 

Vo  — V\ 

go  to  Step  2. 

On  exit,  7,  is  such  that  17  = (7(7])  is  an  estimate 
of  the  maximum  value  of  G.  Therefore  f(7/(7,)|0)  is  an 
estimate  of  the  maximum  value  of  1(tj\0). 

In  Step  2,  JG(~i0)  stands  for  the  array  of  partial 
derivatives  {OG(‘i0)/d'yl,  1 £ /}.  It  follows  from  (12) 
that,  for  i = (»!,...,*„)  g / and  c = (ri,...,rn), 

dG<  1 

Ki)t[{Xt  - Ac  - {\p{KT]t)  - ip{Kr]c)})(l  - 17.) 

+ X](A«'  “ V'( I< nc )})>?,'], 

i'el 

and, 


Kr}t[-  ^(A,<  - Ac  - ~ !/,(/0/c)})'h']> 

‘‘el 

where  4’  is  the  digamma  function  and  A,  = log#,,  t £ I. 

The  formulas  to  compute  the  projection  P\fdo  in  Step 
3 are  derived  in  a similar  fashion  to  the  formulas  to 
compute  fitted  values  of  the  cell  means  in  ANOVA. 
However,  these  formulas  are  not  the  same  because  the 
pararnetrization  for  7 does  not  involve  the  constant  term 
of  ANOVA  parametrizations. 

The  existence  cf  d in  Step  4 is  guaranteed  by  the  con- 
cavity of  the  Dirichlet  likelihood . We  used  routine  e04abf 
from  the  NAg  library  for  the  unidimensional  maximiza- 
tions. Although  it  >vould  take  more  programming,  per- 
haps an  algorithm  that  uses  the  derivative  of  G(~t0+a6o) 
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with  respect  to  a would  be  more  efficient  for  the  unidi- 
mensional  maximizations. 

It  is  possible  to  use  other  convergence  tests  in  Step  5. 
Since  our  interest  here  is  not  on  the  maximizer  7,  but 
on  the  maximum  value  G(~f),  it  is  appropriate  to  use 
the  test  in  Step  5 to  ensure  that  on  exit  7!  provides  a 
function  value  tq  sufficiently  close  to  G(7). 

4 Illustrative  Example 

In  this  section  we  reanalyze  the  2x2  table  given  in  Ta^ 
ble  1 which  classifies  college  students  with  respect  to 
their  political  affiliation  and  their  family  structure  (from 
Braungart  1971,  and  analyzed  in  Bishop,  Fienberg  and 
Holland,  197b,  pp  379-380),  and  by  Albert  and  Gupta 
(1984).  We  use  this  data  to  estimate  the  cell  proba- 
bilities using  the  prior  belief  that  the  two  variables  un- 
der study  are  plausibly  independent.  This  is  the  situa- 
lion  described  in  the  introduction.  For  illustrative  pur- 
poses we  use  normal  distributions  on  the  u-terms  in  the 
parametrization 

logVij  = u + «i|->  + u2(J).  (13) 

More  precisely,  we  use 

Slug  ’ I:  9\I\,V  ~ d(l\, »;),  with  i)tj  reparametrized  ac- 
cording to  equations  (13). 

Stage  ll:  The  u1(i)  are  independent,  i = 1,2.  The 
i<2( j j are  independent,  j = 1,2,  and  also  indepen- 
dent of  the  up,),  t = 1,2.  The  distribution  of 
»i(,)  is  N(nm )t<r,(.))  and  the  distribution  of  ) 

To  use  this  prior  density  one  first  specifies  the  param- 
eter vectors  /t,  = (/ii,  1 > , /Jj ( -_>) ),  and  er,  = (<t1(I),<t1(2)), 
rellecting  the  user’s  prior  knowledge  about  the  propor- 
tion of  students  in  the  two  political  affiliations  and, 
/G  = (/*'_><  1),  f*2(2)).  "2  = (f-n  reflecting 

the  user’s  prior  knowledge  about  the  proportion  of  stu- 
dents in  the  two  family  structures. 

In  this  example  we  set  /t,  = (5;.5),<t,  = (2.0;  2.0) 
and  fi3  — (,5;.5),<ra  = (2.0; 2.0),  reflecting  a rather 


I able  1 Parental  decision  making  and  political  affilia- 
lion.  Source:  Braungart(  1971 ). 
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imprecise  belief  about  the  u-terms.  Second,  one  specifies 
a value  for  the  parameter  A'. 

Albert  and  Gupta  (1982)  and  Epstein  and  Fienberg 
(1991)  computed  the  posterior  means  (5)  for  this  table 
but  they  used  different  distributions  to  reflect  uncertain 
prior  beliefs  about  independence.  In  both  articles  the 
posterior  expectation  of  the  rj's  were  estimated  using  a 
Monte  Carlo  method. 

Table  2 reports  the  computed  values  for  the  poste- 
rior means  of  each  of  the  cell  probabilities  for  several 
values  of  A’  (the  column  headed  by  K = 00  actually  cor 
responds  to  a very  large,  but  finite,  value  of  A ).  The 
estimates  corresponding  to  finite  values  of  K reflect  the 
uncertain  prior  belief  in  independence  by  compromising 
between  estimates  obtained  under  a saturated  model  and 
estimates  obtained  under  an  independence  model 

Figure  1 reports  reports  estimates  of  the  marginal  pos- 
terior densities  for  each  of  the  cell  probabilities.  These 
estimates  were  obtained  using  formula  (11)  for  the  pos- 
terior density  of  On  and  with  the  obvious  modifications 
for  the  other  cell  probabilies.  We  used  m = 20  indepen- 
dent replicates  and  each  of  the  replicates  was  generated 
with  t = 20  cycles  of  the  Gibbs  sampler.  In  addition  we 
computed  these  density  estimates  using  different  values 
of  in  and  t.  On  a plot  the  resulting  estimates  appeared 
to  be  fairly  similar  for  values  of  t and  m as  low  as  10. 


Table  2:  Computed  values  of  posterior  means  for  differ- 
ent values  of  A' 
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5 Discussion 

This  article  reports  on  an  implementation  of  the  Gibbs 
sampler  to  estimate  the  full  posterior  density  of  the  array 
of  cell  probabilities  of  n-way  contingency  tables  using 
the  method  proposed  by  Epstein  (1990)  and  Epstein  and 
Fienberg  ( 1 99 i ) . One  easily  obtains  estimates  of  the 
posterior  distributions  of  the  individual  cell  probabilities 
as  a finite  mixture  of  beta  densities. 

Gelfand  and  Smith  ( 1990)  proposed  the  Gibbs  sampler 
as  an  easy  to  implement  algorithm  to  generate  deviates 
from  posterior  distributions.  An  expeditious  implemen- 
tation requires  that  all  necessary  distributions  be  avail- 
able for  sampling.  This  was  not  the  case  in  this  article 
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and  we  expended  some  efforts  to  generate  deviates  from 

M*]- 

To  sample  from  [r/|0]  we  used  a scheme  that  does  not 
requires  the  marginal  density  [0],  which  is  often  the  main 
obstacle  to  compute  [t/|0],  The  scheme  uses  the  facts 
that  [T7]  is  available  for  sampling,  that  [^|rj]  as  a func- 
tion of  rj  can  be  viewed  as  a concave  likelihood  function 
with  a unique  maximum.  This  maximum  provides  the 
height  of  a box  for  a rejection  sampling  method.  The 
gradient  projection  method  proved  to  be  fast  and  very 
easy  to  program.  We  are  currently  investigating  its  use 
in  maximum  likelihood  estimation  for  generalized  linear 
models  and  will  report  on  this  work  elsewhere. 

Our  scheme  to  sample  from  [tj|0]  can  be  used  to  im- 
plement the  Gibbs  sampler  for  a variety  of  other  prob- 
lems involving  two-stage  priors  where  the  first  stage  is 
the  conjugate  prior  for  the  sampling  distribution  and  the 
second  stage  distribution  is  available  for  sampling. 

Furthermore,  we  feel  that  the  simplicity  of  the  Gibbs 
sampler  warrants  exploring  new  algorithms  to  generate 
deviates  from  distributions  that  thus  far  have  not  been 
available  for  sampling.  For  clarity  we  used  a simple  2x2 
example  to  illustrate  our  implementation. 

In  higher  dimensional  tables,  it  makes  special  sense  to 
utilize  the  structure  of  tj  in  terms  of  its  marginals  as  part 
of  the  algorithm  and  to  set  up  a cycle  involving  steps  for 
the  conditional  densities  for  each  of  the  marginals  of  t] 
instead  of  a single  step  for  [r/|0].  We  hope  to  report  on 
the  details  of  such  an  algorithm  at  a future  date. 
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